arXiv:1508.06672vl [q-bio.PE] 26 Aug 2015 


THERE ARE NO CATERPILLARS IN A WICKED FOREST 

JAMES H. DEGNAN AND JOHN A. RHODES 


Abstract. Species trees represent the historical divergences of populations or 
species, while gene trees trace the ancestry of individual gene copies sampled 
within those populations. In cases involving rapid speciation, gene trees with 
topologies that differ from that of the species tree can be most probable under 
the standard multispecies coalescent model, making species tree inference more 
difficult. Such anomalous gene trees are not well understood except for some 
small cases. In this work, we establish one constraint that applies to trees 
of any size: gene trees with “caterpillar” topologies cannot be anomalous. 

The proof of this involves a new combinatorial object, called a population 
history, which keeps track of the number of coalescent events in each ancestral 
population. 

Keywords: gene tree, species tree, multispecies coalescent, anomalous gene tree, 
coalescent history, phytogeny 


1. Introduction 


An important distinction is made in phylogenetics between species trees and 
gene trees. Species trees describe the ancestral relationships between populations 
of individuals (each carrying many genes) that have undergone divergences at var¬ 
ious times in the past. A gene tree tracks the ancestral relationships for a single 
gene sampled from individuals within extant species populations. In a species tree, 
the ancestral populations associated to edges have finite durations (see Figured]). 
As a result, going backwards in time, several gene lineages from sampled individ¬ 
uals may remain distinct within a common ancestr al population — a phenomenon 
called incomplete lineage sorting ( Maddisoiil . Il99'^ — and then merge with other 
lineages to form a gene tree that is topologically dissimilar to the species tree. An 
understanding of this phenomenon, which leads us to expect some, and possibly 
many, gene trees to differ from the species tree, is essential to statistical approaches 
to inference of species trees from genomic data sets. 

The multispecies coalescent model gives a stochastic de scription of gene tree for¬ 
mation within a species tree. Kingm an’s coalescent model ( Kingmai] . ll982l : lHudsonl . 
198,ll : Taiimal . 1 198, 'll : IWakelevL l2008[l is adopted for each population (edge) of the 
the species tree, so that the waiting time until coalescence between any pair of gene 
lineages within a population, going backwards in time, is exponentially distributed 
with mean 1. At each node of the species tree, gene lineages reaching it from its 
descendent edges ‘enter’ the population above starting a new coalescent process. 
Combining calculations of probabilities for the within-population Kingman coales¬ 
cent process with combinatorial features of the species tree, it is possible to calculate 
the p robability of the formation of any topological gene tree (jPeenan and Salten . 
200511 . A rooted species tree, with branch lengths, relating n taxa thus determines 
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Figure 1. A species tree with the matching gene tree (A,B,C) 
under three different coalescent histories (out of 13 possible), and 
a nonmatching caterpillar gene tree (D). Speciation events occur 
when populations (shaded polygons) split into two new populations 
going forward in time (downward). The population ancestral to 
the root of the species tree (lightest shading) is assumed to extend 
infinitely into the past; all other populations have finite durations. 
The nodes of the trees are labelled in a postorder traversal using 
large, boxed numbers for the species tree, and unboxed numbers 
for the coalescent events. The vectors h, y give coalescent histories 
and population histories, respectively, as explained in Section [2l 
using node labels as vector indices. 
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a probability mass function on the set of all (2n — 3)!! rooted topological gene trees 
defined on the same species. 

Under this model, the most likely gene tree topology does not necessarily match 
that of the species tree. For example, the species tree (((a, 6), c), d), with choices 
of appropriate branch lengths, can result in any of the symmetric gene tree topolo¬ 
gies, ((a, 6), (c, d)), ((a, c), (&, d)), or ((a, d), (6, c)), being more probable than the 
gene tree (((a, 6), c), d). The term anomalous gene tree (AGT) is used to describe 
gene trees that are more probable than the gene tree with the same topology 
as the species tree. Although for four taxa, AGTs only arise for an asymmet¬ 
ric species tree, for any species tree topology with five or more taxa there are 
branch lengths (durations of i nternal populations) that lead to at least one AGT 
( Degnan and Rosenbere . 2006[l . 

Although this result describes the shapes of species trees that ca n have AGTs, _ 

less is known about gene tree shapes that can be AGTs. For four taxa I Degnan and Rosenberel 
2006ll . explicit computation of gene tree probabilities under t he coalescent showed 


that only symmetric gene trees can be AGTs . For five taxa ( Rosenberg and Taol . 
I 2 OO 8 I I. a computation showed that if the species tree is completely unbalanced, e.g, 
((((a, 6), c), d), e), then any gene tree with a different unlabeled topology can be an 
AGT. However, for five-taxon species trees of any topology, a completely unbal¬ 
anced gene tree is never an AGT. Furthermore, any noncaterpillar gene tree can 
be an AGT for some species tree. For example, if the species tree is a caterpillar, 
then any noncaterpillar gene tree is more probable than the matching gene tree if 
all species tree branch lengths are sufficiently short ( Degnan and Rosenberd . 200(ih . 

We refer to completely unbalanced trees, such as ((((a, 6), c), d), e) and its analogs 
with more taxa, as rooted caterpillars, usually omitting the word “rooted” as this 
paper only concerns rooted trees. We generalize the above observations by showing 
that for species trees of any size, there are no AGTs with caterpillar topologies. This 
also implies t he statement chosen as the tit le of this paper, using the terminology 
introduced in Degnan and Rosenberg! ( 200fil l which we restate in the next section. 

While our results are theoretical, they have potential to contribute to the prac¬ 
tice of species tree inference. For instance, when different genes yield different 
inferred phylogenetic trees, or different methods yield conflicting estimated species 
trees, evolutionary biologists sometimes w onder if their inferred tree is an AGT 
rather than the desired species tree (e.g. Gastillo-Ramfrez and Gonzaled . I2OO8I : 
Zhaxvbaveva et [ 2 ^. A recent paper uses a heuristic test based on taking 


subsets of four-t axa to conclude that there is evidence of the anomaly zone in a 
skink phylogeny ( Linkem et all . 2014h . One implication for our results is that if a 
phylogenetic method returns a caterpillar tree (as often happens in with smaller 
numbers of species), the empirical phylogeneticist can be sure that an AGT was 
not inferred. 


2. Notation and Definitions 

Let X denote a finite set, whose elements we refer to as taxa. By a tree on X 
we will mean a topological tree with leaves bijectively labeled by X. 

Definition 1. A speeies tree a = {ip, A) on X is a rooted, binary tree ip on X 
together with a vector A = (Ai,..., Xn- 2 ) of internal edge lengths (weights), where 
n = |Ar|, {d,..., e„_ 2 } are the internal edges of ip, and Ai > 0 is the length of Ci 
for i = 1,... ,n — 2 
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Nodes of the species tree represent speciation events, and edges represent popu¬ 
lations extending over time. Edge lengths are given in coalescent units which (for 
constant population size) are the ratio of elapsed time to population size. It is con¬ 
venient for the coalescent model to view ip as augmented by an additional directed 
edge leading to its root, in order to refer to a population ancestral to the root. We 
treat this edge as having infinite length, and consider it to be an internal edge of 
the species tree. 

The coalescent on a species tree cr models the formation of gene trees by the 
merging of ancestral lineages (going backwards in time) within the populations 
represented by the tree’s edges. We focus on the situation where one lineage is 
sampled per taxon, so pendant edge lengths for the species tree would be irrelevant. 
With this sampling scheme, a gene tree can also be leaf-labeled by X. 

Since under the standard coalescent only binary gene trees have positive proba¬ 
bility of being realized, and we are interested solely in the topological form of these 
trees, we make the following definition. 

Definition 2. A gene tree, T, on taxa Ai is a rooted binary tree on X. 

Definition 3. Given a species tree a = {ip, A), the matching gene tree is the gene 
tree Tm isomorphic to ^ as a leaf-labeled tree. 


Though it is in some sense artificial to distinguish between ip and Tm, we do 
so in order to keep clear the difference in viewpoint between the fixed topological 
species tree ip and one of the possible states, Tm, of the gene tree random variable 
under the coalescent model. 

Probabilities of an event E under the 1-sample per taxon coalescent model on 
a species tree a are denoted Fa{E). In particular, the probability of a gene tree 
T is Pcr(T). (See Degnan and Salter, 2005, for details on computations of such 
probabilities.) 

Definition 4 (Degnan and Rosenberg, 2006). A gene tree T is said to be an 
anomalous gene tree (AGT) for a species tree cr = {ip, A) if Fa{T) > Po.(rM)- 


AGTs are significant, since their existence thwarts picking t he most frequently _ 

occur ring gene tree in a sample as the estimate of the species tree ([Degnan and Rosenberd . 
I 2 OO 6 I ). Though intuitively appealing, this democratic vote method is not statisti¬ 
cally consistent. The following pathological situation is one where such voting is 
particularly misleading, in that voting based on gene trees arising from several 
species trees always ranks the true tree last. 

Definition 5. ( Degnan and Rosenbergl[2OO6II A wicked forest IT is a set of at least 
two species trees, with distinct topologies but defined on the same set of taxa X, 
such that for all ai,aj G W with i ^ j, the gene tree T^j matching (jj is an AGT 
for cr,;. 


The first set of trees noticed t o form a wicked forest had six taxa and was given 
by Degnan and Rosenberg ( 2006l l. Their discovery was motivated by trying to find 
examples of trees that were AGTs yet were less balanced than the matching tree. 
Rosenberg and Tao (2008) fully characterized wicked forests for five-taxon trees, 
the smallest number of taxa for which wicked forests exist. The maximum number 
of trees that can form a wicked forest for n > 5 taxa is not known. An example 
of a wicked forest with three trees is shown in Figure 2 and is based on swapping 
two-taxon clades in the trees. 
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Figure 2. A wicked forest with three balanced 8-taxon species 
trees. Branch lengths are drawn to scale with the total depth 
of the tree equal to 0.11 coalescent units. For each species tree 
i G {I, II, III}, the two gene trees with the matching topology 
for species tree j G {/,//,///} \ {i} are AGTs for species tree i. 
Species tree / in Newick format is (((A : .108, B : 0.108) : .001, {C : 
0.009, H : 0.009) : 0.1) : .0010, {{D : 0.0797, G : 0.0797) : 0.03, {E : 
0.0097,F : 0.0097) : .100) : .0003). 


To compute and compare the probabilities of various gene trees under the coa¬ 
lescent model, we need further technical notions. 

We treat all trees as directed graphs, with all edges directed away from the 
root (except, in species trees, for the “edge” ancestral to the root). Since we 
depict trees with the root placed above the leaves, we use terminology such as 
‘ancestral’ and ‘above,’ or ‘descendent’ and ‘below’ interchangeably to describe 
directed relationships of nodes and edges. 

Under the coalescent model on a species tree a = {tp, A) on X, all gene trees 
T on X are realizable. That is, Pcr(F) > 0 for all T. To compute Pcr(F) one 
considers the various ways in which T is realizable. This may be done at several 
levels of detail. The most detailed non-metric characterization would be to specify 
coalescent histories with in-population rankings, in which for each node of T one 
indicates an edge of ip on which the coalescent event that node described occurred, 
as well as an ordering to the coalescent events wit hin each species tree edge. (These 
are called instantiations of coalescent histories bv IPeenan and Salter ( 200511 ). 

A less detailed level is to specify coalescent histories, where the ranking of coales¬ 
cent e vents on edges is not recorded. This is the key notion used bv IPegnan and Salter 
( 2 OO 5 I I for the computation of gene tree probabilities (with adjustments for the count 
of possible in-population rankings). 

Finally, a population history is an even cruder summary. It records only the 
number of coalescent events on the edge, but does not record which lineages coa¬ 
lesced. To the best of our knowledge, this concept has not been used in previous 
works studying species trees and gene trees, though it plays an essential role in our 
arguments. 
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To formalize these notions, it is useful to encode the topology of a tree through 
the ancestral relationships of its nodes. Let Vt denote the set of nodes of a rooted 
tree T (either a gene or species tree), and It C Vt the subset of internal nodes. 
Let 


1 if node i is ancestral or equal to node j, 
0 otherwise. 


This indicator function a on Vt x Vt fully encodes the topology of T. Labeling 
the edges of T by the label of their end nodes, a also gives indicators of ancestral 
relationships between edges. 


Definition 6. (|Degnan and Salteil 120051) Let a = {tjj, A) be a species tree and T 
a gene tree on X, with \X\ = n. A coalescent history for T is an (n — l)-tuple 
h = Ht = with each hi S satisfying 

(1) for all i G It , the set of leaves descended from node i of T is a subset of the 
set of leaves descended from node hi of tp; i.e., for all leaf labels fc, aik = 1 
implies = 1, and 

(2) if node i is ancestral to node j on T, then node hi is ancestral or equal to 

node hj on cr; i.e., = 1 implies ah^hj = 1- 

The set of coalescent histories for a species tree with topology ip and a gene tree T 
is denoted H^^t- 


Conceptually, such a history records that the coalescent event forming node i of 
the gene tree occurs in the population immediately above node hi of the species tree. 
Condition (1) thus encodes the idea that coalescences must predate the most recent 
common ancestor of the populations from which they were sampled. Condition (2) 
ensures that the sequence of coalescences is consistent with the topology of the gene 
tree; e.g., if a gene tree displays subtree ((a, 6), c), then c cannot coalesce with (a, h) 
in population i unless a and h have coalesced either in population i or one of its 
descendant populations in the species tree. 

A coalescent history can be viewed as an event under the coalescent model. 
Moreover, H^^t gives a partition of the event that the gene tree is T into disjoint 
subevents h. Although by definition P(T, h) = P(h), for clarity we prefer to include 
the redun dant reference to T in thi s notation. Note that Po- (T, h) > 0 for every 
h € Htp^t ( Degnan and Salt^ l2005[) . 


Definition 7. Let a = (-0, A) be a species tree on X, with \X\ = n. A population 
history for ip is an (n — l)-tuple y = {yi)i^i^ with pi G {0, 1, ... n — 1} satisfying 

(1) 2/i = ^ - and 

(2) Ejeu (1 — yj)cxij > 0 for all i G lip. 

The set of all (n — l)-tuples satisfying conditions (1) and (2) is denoted 


One should interpret a population history as indicating the number of coalescent 
events on each edge of a species tree that leads to a realization of some (unspecified) 
gene tree. Then condition © of the definition is interpreted as stating that over 
the full species tree all lineages ultimately coalesce into one — i.e., there are a total 
of n — 1 coalescences. 

Condition ([2]) requires more elucidation: First note that for i G lip, 

^ ^ f 

jeA 










THERE ARE NO CATERPILLARS IN A WICKED FOREST 


7 


where ^i is the number of leaf descendants of node i on ip. This equivalence is due 
to the number of leaf descendants of a node of a binary tree being the number of 
internal descendants plus 1. As an example, for the species tree in Figure 1(A), we 
have 

ail = a22 = aai = a32 = aaa = a4i = a42 = 043 = 044 = 1 

and aij = 0 for all other choices of i and j. To further illustrate the example, 

^3 = <^31 + 032 = 033 = 3 = 4—1. 

Thus condition © is equivalent to 

£i ^ ^ Uj OP-ij j 

for each i G I.^. This expresses that the number of coalescent events occurring on 
edges i and below in the species tree cannot exceed the maximum possible for the 
lineages present in that part of the tree. 

For any fixed species tree ip and gene tree T there is a natural map from the 
set of coalescent histories to the set of population histories, defined by ‘forgetting’ 
which lineages coalesce: More formally 


t Yip 


where 

y^= 

J&It 


is the sum of indicators. 

Population histories can also be viewed as events under the coalescent model. 


Definition 8 . Given a species tree cr, we say that a population history y € 
is compatible with a gene tree T if they can be simultaneously realized, i.e., if 
^cr{T,y) > 0. We use to denote the set of population histories compatible 
with a gene tree T. 

Note that = ^iP,t{H^,t)i and Pa(T,y) = She*- .^(,)P.(T,h). 

The loss of information in passing from coalescent histories to population histo¬ 
ries is illustrated in Figure 1. In (A) and (B), two different coalescent histories for 
the matching gene tree yield the same population history. For (A), the coalescent 
history is h = (4, 3,4,4) because node 2 of the gene tree coalesces in population 3 
of the species tree (hence /12 = 3), while all other nodes coalesce in population 4 
{hi = 4 for 7 7 ^ 2). In (B), the coalescent history is h = (3,4,4,4) since node 1 of 
gene tree coalesces in population 3 of the species tree {hi = 3). Both coalescent 
histories have one coalescence in population 3 and three coalescences in population 
4, making their population histories both (0, 0,1, 3). 

Figure 1(C) and (D) illustrate another aspect of coalescent histories and popu¬ 
lation histories: that the probability of the same numbers of events in each species 
tree population can have a higher probability for a non-matching tree. In (C), 
there are two coalescent events in population 3. For this gene tree, either the (a, b) 
coalescence or the (c, d) coalescence can occur more recently within population 3 
and result in the same gene tree topology, coalescent history, and population his¬ 
tory, but different in-population rankings. For the same population history with a 
caterpillar gene tree (D), however, the gene tree topology constrains the coalescence 
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of lineage c to be more ancient than the coalescence of a with b. This results in a 
lower probability for the same population history when the gene tree is a caterpillar 
compared to the matching gene tree. 

3. Results 


Our main result is the following: 

Theorem 9. For a species tree a = let T be any caterpillar gene tree, with 

T 7 ^ Tm- Then Pcr(ir) < In particular, a caterpillar is never an AGT. 

As a consequence, we also obtain: 

Corollary 10. There are no caterpillars in a wicked a forest. 

Proof. Any species tree in a wicked forest must have a topology which can be an 
AGT for some other species tree defined on the same taxa. Since caterpillars cannot 
be AGTs by Theorem [51 no species tree in a wicked forest can have a caterpillar 
topology. □ 

Our proof of the theorem is built on a succession of lemmas. To simplify state¬ 
ments, we assume throughout that the species tree a = {ip, A) has been fixed. 

The first lemma is immediately clear. 

Lemma 11. The probability of a gene tree T can be written as 

P,(T)= ^ P.(T,y). 

y&y-ip.T 

Lemma 12. The matching gene tree Tm is compatible with every population his¬ 
tory. That is, Y.,p^Tm = ^ ^,Tm every gene tree T. 

Though the proof of this is somewhat technical, the idea behind it is simple: With 
a population history y fixed, we pick any cherry on Tm, and have the coalescent 
event forming that cherry occur on the edge of ip as close to the leaves as possible 
among those allowed by y. We then show that deleting the cherry from Tm and ip, 
and the coalescent event from y leads to trees and a population history with one 
fewer taxa, so an inductive argument gives the result. 

Proof of Lemma [7^ We must show that if y € then Po.(1m, y) > 0. But 
P(7M,y)= X] P('rM,h), 

so it suffices to show there is some h g H^,Tm since for such 

an h. Per (Tm, h) > 0. 

We prove this by induction on the number of taxa n. The base case of n = 2 is 
clear, since there is only one gene tree T = Tm, and one coalescent history h, with 
Po-(Tm, h) = 1. 

Now assume the result is known for n — 1 taxa. For n-taxon trees on taxa X, 
identify the nodes of the matching gene tree Tm with those of the species tree ip so 
that we may use the same notation to refer to either. Pick an internal node v onip 
that is parental to exactly two leaves, say a and b. On both ip and Tm, prune the 
edge descending from the node v to leaf a, and then suppress that node, to obtain 
matching (n — l)-taxon trees ip and Tm on taxa X \ {a}. We may thus view the 
node sets of the four trees as satisfying = Vf^ CV^ = Vtm ■ 
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We similarly relate a population history y on -0 to a population history y on 
"0 in the following way: Let w = w{y) be the most recent vertex on tp, ancestral 
or equal to population v, labeling a population in which a coalescent event occurs. 
That is 

w = min{i | Uiy = 1, j/i > 0}, 

where the minimum is taken with respect to the ancestral relationship. Then let 
y = where iji = yi — S{i = w). (In essence, this simply removes one 

coalescent event from the population above w, but in the case where w = so 
yy = 1, this is done through dropping yy from y.) 

We next verify that y is a population history for ip, by showing it satisfies the 
appropriate constraints. Clearly it has non-negative entries. That condition ([I} of 
Definition [7] is satisfied for y is also clear, since we have reduced the sum of the 
entries in y by 1. 

For the inequality constraints of condition ([2|), first suppose w = v. Then for all 
j G we have yj = yj. Thus for i & I^, 




(1 - yj)a^j = V (1 - Vj) 




- (1 yv)0!.iy ^ ( (1 yj) Q.ij . 

But the first term in this last expression is 0, since = 1, and the second is 
non-negative because y is a population history vector for ip. Thus condition ([2]) is 
established in this case. 

Now suppose w is ancestral to v. Then iji = yi for all i ^ w, while yw = yw — 
so 




(1 - ijjjaij = Qfiu, + ^ (1 - yj)a. 


3^h 


— o^iw (1 yv)(. 


+ y3)^i3- 

3^1^ 


By the minimality of w, we know ?/„ = 0. It will follow that the above expression 
is non-negative in any case when a™ ~ o;™ > 0. This is true if either i is ancestral 
or equal to w (and hence ancestral to v), or i is not ancestral to v (and hence not 
ancestral to w). 

The remaining subcase to consider is when i is ancestral to v but not ancestral 
or equal to w, i.e., i lies between v and w. In this situation aiy, — a.iy = —1, so we 
must show 

^(l-yj)Q!y > I. 

But if i has two internal nodes as children, say k and then 


(I - yj)aij = (I - yi) -y V (I - y^jakj + 








{l-yj)aij. 


The two sums on the right are non-negative, because y is a population history 
vector for ip. Since i/i = 0 by the minimality of w, we obtain the needed inequality. 
The case where i has only one internal node as a child is similar. 

This concludes the argument that y is a valid population history for ip. 


10 


JAMES H. DEGNAN AND JOHN A. RHODES 


Since y is a population history for tjj, by the induction hypotheses there is a 
coalescent history h G with <f’^ (h) = y. Define a coalescent history for 

Tm on by h = (hi)i^i^ with hi = hi ior i G 'ip and hy = w. 

To verify that h G rnust check that it satisfies the constraints of 

Definition [6l For a matching tree, condition m is equivalent to saying that hi 
must be ancestral or equal to i. For i ^ v, this follows immediately from the fact 
that hi is ancestral to i on Tm- Since hy = vj, and vj ancestral to v, the constraint 
is satisfied in all cases. 

For condition m, we must check that if i is ancestral to j on Tm, then hi 
is ancestral or equal to hj on xp. For j ^ v, this follows immediately from the 
analogous property for h, but for j = v requires more explanation. 

Suppose i is ancestral to v on Tm- Then i is ancestral to leaf b on Tm, hence i is 
ancestral to leaf b on Tm, so hi must be ancestral to leaf b on ip, so hi is ancestral 
to leaf b on ip, and hence ancestral or equal to node v- li w = v so hy = v, we are 
done verifying the constraint. If w is ancestral to v, then since (h) = y, the 

minimality of w ensures no entries of h are nodes between leaf b and node w on ip. 
Thus hi does not lie between leaf b and node w on ip. Since hi is ancestral to v, it 
must therefore be ancestral or equal to w = hy. 

Finally, observing = y completes the proof. □ 


Lemma 13. IfT is a caterpillar gene tree, then for any population history y G 
^c7iT,y) < PCT(TM,y). 


Proof. From Lemma IT^ if Pct(T, y) > 0, then Pct(Tm, y) > 0 as well, so Tm can be 
realized with y. 

Now each of these probabilities can be expressed as a product of two terms: one 
which depends only on T and y, and one which depends only on y and A . More 
specifically, 


P<,(T,y)=TT.y/(y,A), 

(1) P,(TM,y) = TT„,y/(y,A), 

where Rr.y counts the number of coalescent histories with in-population rankings 
consistent with the gene tree T and y, and 

n—1 -| 

fivA )= n 1 

i—l * 


with ji = ii — number of lineages ‘entering’ population i from 

below and ki = ji — yi the number of lineages ‘leaving’ population i above, djk 
the number of sequences of coalescent events that may occur for j labeled entering 
lineages to coalesce to k leaving lineages, and gjk{u) is the function which gives the 
proba bilities that ? lineages in a population coalesce to k lineages in u coalesc ent 
units ( Degnan and Salter . 20051 l^senbergl boOStlTavar4 1984 Wakelevl . l2008l) . 

Since T is a caterpillar, its realization requires a specific ranked ordering to 
coalescent events, so Rr.y = 1- Since Rru-y ^ 1 the lemma follows from equations 

(HD. ’ □ 


Lemma 14. The population history 1 = {yi)i^i^ with all yi = 1 is consistent with 
the matching gene tree Tm, but no other. That is, 1 G if o,nd only if T = Tm- 
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Proof. That 1 G Y.^^Tm is a consequence of Lemma IT^ 

Suppose 1 G To establis h that T = Tm it is eno ugh to show these gene 

trees must have the same clades ( Semple and Steel . l2003li . Since Po-(T, 1) > 0, T 
is realizable with one coalescent event on each internal edge of tp. But for any i, 
there are li taxa descended from node i of '0, and population history 1 implies 
li — 1 coalescent events occur on or below the edge above i. Thus for both T and 
y = 1 to be simultaneously realized, the lineages of all taxa descended from i on 
Ip must coalesce to form a clade on T. Thus every clade of V) is a clade on T, so 
T = Tm. □ 


Proof of Theorem\^ From the lemmas, 

P,(T)= P.(T,y)< ^ P,(TM,y)< ^ P,(TM,y) = P.(TM). 

y6T/,,T yeK0,T yeK0,Tj^ 

The first equality is from Lemma 1111 the next inequality from Lemma 1131 the next 
from Lemmas [T^ and 1141 and the final equality from Lemma [TT] again. 

□ 

Remark 15. Let cr = {tp, A) be a species tree on taxa X, and let T be any non¬ 
matching caterpillar gene tree on X. Then the above considerations show 

( 2 ) = \Y^,t\ < \Y^,Tm \ < 

i.e, the number of consistent coalescent histories is larger for matching trees than 
for any nonmatching caterpillar tree. It has previously been shown for some species 
trees that the number of coalescent histories can be larger for a nonmatching, non¬ 
caterpillar tree than for a matching tree, although the smallest trees for which this 
occurs have 7 taxa ( Rosenberg and Degnan . 2010ll . Equation ((2) shows that gene 
trees with more coalescent histories than the matching tree are never caterpillars, 
which presents a combinatorial analog to the result that caterpillar gene trees can 
never be AGTs. 


4. Anomalous Ranked and Unrooted Gene Trees 

Recentl y, the concept of anoinalous gene trees has also been extended to ranked 
gene t rees (IPegnan et d. . 2012a : Disanto and Rosenbergl 2014 ) and unrooted gene 
trees ( Degnanl ■' 2013^ . 

A ranked gene tree topology encodes the relative timing of the branches, so that, 
as an example, the ranked gene tree topologies in Figure 1(A) and (B) are distinct 
because the ordering of the (a, b) and (c, d) coalescences are reversed, even though 
the unranked gene tree topologies are the same. An anomalous ranked gene tree 
(ARGT) is a ranked gene tree tha t is more probable tha n the ranked gene tree that 


matches the ranked species tree (|Degnan et all l2012al) . The ranked gene tree in 


Figure 1(A) matches the ranked species tree, while the ranked gene tree in Figure 
1(B) doe s not. In spite of the results of this paper, caterpillar gene trees can be 


ARGTs (|Degnan et aLl . l2012al) . i.e., a caterpillar gene tree can be more probable 


than a matching ranked gene tree, even though it must be less probable than the 
matching unranked gene tree. 

On th e other hand, neither caterpillar nor pseudo-caterpillar species trees have 
ARGTs ( Degnan et oi.! . [2012ah . (A pseudo-caterpillar tree is one obtained from a 
caterpillar by attaching two edges to each leaf in the caterpillar’s cherry. The species 
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Figure 3. A ranked wicked forest. The two trees have the same 
unranked topology but have different rankings since, for exam¬ 
ple, (d, e) is the most recent common ancestor for the left tree, 
while (a, b) has the most recent common ancestor in the right tree. 
Subtrees in rectangular shaded boxes have ARGTs, shown in cor¬ 
responding circular shaded regions in the facing tree. For the sub¬ 
trees in rectangular regions, there are relatively long branches sep¬ 
arating the two- and three-taxon clades. Subtrees that are not in 
boxes have low probability for any particular sequence of coales¬ 
cences because all branches are short. These subtrees have short 
branches separating two- and three-taxon clades. 


trees in Figure 1 are pseudo-caterpillars.) Therefore, extending the concept of a 
wicked forest to ranked gene trees, there are no caterpillars or pseudocaterpillars 
in a wicked forest for ranked gene trees (a nonempty set W of distinct species trees 
where the ranked topology each member is an ARGT for all other members). An 
example of a wicked forest for ranked gene trees using 10 taxa is shown in Figure 
3. The smallest number of taxa needed for a wicked ranked forest is unknown. 

The gene tree probabilities for Figure 3 are most easily approximated by assum¬ 
ing that the branches between the root and the shaded regions are very long, so 
that coalescence of all available lineages is virtually guaranteed on these branches. 
Then probabil ities for the left and right shaded subtrees can be obtained using 
formulas from Deenan et all ( 2012ljl . For the tree on the left in Figure 3, let the 
subtree in the rectangular box be cr)^, and the tree in the circular shaded region 
be al, so that the overall species tree is (Tl = {a^^:Xi,al:X 2 ), where Ai and A 2 are 
very large. Similarly, the tree on the right of Figure 3 is A 3 , 0 - 0 ^ 4 ). Here 

cr □ and are species trees on Xi = {a, 6 , c, d, e} and al and are species trees 
on X 2 = {/, g, h, i,j}. We let and T* be the matching ranked gene trees for 
and al, respectively. Let Tl and Tr denote the matching ranked gene trees for the 
left and right trees, respectivel y. 

From IPegnan et 1 2012 lJ l. branch lengths can be chosen so that if ctl is the 
species tree, then with probability arbitrarily close to 2 / 8 , the ranked gene tree re¬ 
stricted to taxa Xi is T/,, and with probability arbitrarily close to 3/8 is T^. Branch 
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Figure 4. (a) Two rooted caterpillar species trees constituting a 
wicked forest for unrooted gene trees. For these species trees, the 
two shorter internal branches have branch length 0.05 coalescent 
units while the longer one has length 0.5 coalescent units, (b) The 
unrooted gene tree on the left is the most probable unrooted gene 
tree given the species tree on the left in (a). The unrooted gene 
tree on the right in (b) is the most probable unrooted gene tree 
given the species tree on the right in (a). 


lengths can also be chosen so that for taxa X 2 , the ranked gene tree restricted to 
taxa X 2 has nearly equal probability of being either or T^. Therefore, for some 
choices of branch lengths, 

,, 3 

Pa. (A) 2' 

Similar arguments show that can be approximately 1.5 times as probable as Tr 
when (Tr is the species tree. In this example, the wicked forest contains two species 
trees with identical unranked topologies but different ranked topologies. Examples 
of wicked forests for ranked gene trees that contains trees with different topologies 
can also be constructed. For example, one could swap taxa b and c in ctr but not 
(Tr and still obtain a wicked forest for ranked gene trees. 

Probabilities of unrooted gene trees can be obtained by summing over the prob¬ 
abilities of all rooted gene trees with the same unrooted topology. An unrooted 
caterpillar tree is a binary tree where every internal node is connected by an edge 
to a leaf node. Unrooted caterpillar gene trees can be anomalous unrooted gene 
trees (AUGTs), i.e., more probable than the unrooted gene tree with the same 
unrooted topology as that of the species tree. Figure 0] shows a wicked forest for 
unrooted gene trees, which we define as a nonempty set W of rooted species trees 
such that for ai,aj G W, Pc,i{u{Tj)) > Pcr-{u{Ti)) for i ^ j, where u{Ti) is the un¬ 
rooted topology of Ti, and Ti has the same rooted topology and (7^. This example 
shows that caterpillars can be in a wicked forest for unrooted trees. 
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5. Future work on AGTs 


The fact that caterpillar gene trees cannot be AGTs fits the intuition that AGTs 
are more ea sily found among gene trees wi th more balanced topology than the 
species tree ( Deenanl 2013t Rosenberg . 2013 1. For unbalanced species trees, choos¬ 
ing sufficiently short branch lengths gives gene trees with a higher amount of tree 
balance greater probability (jPegnan and Rosenbergi 1200611 . However, the fact that 
even perfectly balanced species trees can have AGTs ( Degnan and Rosenberg! 2006) 
suggests that it is difficult to characterize all AGTs. Thus, there is still an open 
question: for a given species tree topology, which gene tree topologies can be AGTs? 

The strategy of Degnan (2013) can be used to predict many of the AGTs for 
a given species tree: First one considers a smaller species tree induced by taking 
a subset of taxa. If this smaller tree has AGTs, then ones for the larger tree 
can be predicted by re-grafting the removed taxa onto the smaller AGTs. As 
an exampl e , for the species tree (((a, &), (c, d)), e), called a pseudo-caterpillar by 
Rosenbera ( 2007tl . removing taxon c results in the caterpillar (((a, 6), d), e), which 
can have AGTs ((a, &), (d, e)), ((a, d), (6, e)), and ((a, e), (6, d)). Placing c back 
on these AGTs results in ((a, 6), ((c, d), e), ((a, (c, d)), (&, e)) and ((a, e), (b, (c, d))). 
While this perhaps suggests that the 5-taxon pseudo-caterpillar species tree cannot 
have a pseudo-caterpillar AGT, the verificatio n of that fact currently dep ends on a 
detailed calculation of gene tree probabilities ( Rosenberg and T^ 12008 1. 

While it would be desirable to have an efficient way of determining which topolo¬ 
gies can be AGTs for a given species tree, potentially more valuable would be meth¬ 
ods for determining the set of species trees for which a given gene tree can be most 
probable. Such candidate species trees could then be used to reduce the search 
space for the optimal species trees to explain a set of gene trees ( Fan and Kubatkol 
201 ill . 


Further results on AGTs may also be helpful in interpreting results of species 
trees inference by con c atenation of gene sequences. In pa rticularly, simulations 
( Kubatko and Degnan . 2007 : DeGiorgio and Degnan . 2010ll as well as theoretical 
results (?) have shown that when maximum likelihood is used to infer a tree 
based on concatenated DNA sequences, the inferred tree can be misleading, in the 
sense that concatenating more genes can be more likely to lead to an erroneous 
inferred species tree. In simulations where concatenation has been misleading, the 
returned tree is often an AGT. Simulations als o suggest that concatenatio n performs 
better when the true spe cies tree is balanced jLeache and Rannala ^ 2011), and thus 
AGTs ye less common ( Degnan and Rosenberg! 2006; Rosenberg and Taol . 20081: 
Degnanl . 2013l l. Studies are needed to determine whether in larger trees inferred 
from empirical data, certain tree shapes inferred from concatenation tend to be 
more reliable than others. 
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